Before answering any questions, let us set up our framework and prepare the data. In these code chunks, we:

  1. Load the necessary R packages,
  2. Import the data from the CSV file,
  3. Adjust missing column names,
  4. Order our data,
  5. Format dates correctly,
  6. Check which columns contain missing values (NA),
  7. Find a solution for missing dates,
  8. Omit the remaining rows which contain missing values.
library(data.table) # data tables, fread
library(lubridate)  # dates
library(ggplot2)    # plotting
library(knitr)      # r markdown
library(kableExtra) # kable parameters
library(ggmap)      # maps visualizations
library(osmdata)    # open street maps
library(sp)         # spatial data
library(geosphere)  # compute distances from coordinates
library(qcc)        # pareto charts
library(tidyr)      # pivot_longer function

LaptopSales <- fread("LaptopSales_red.csv")
colnames(LaptopSales)[1] <- "TransID" # First column had no name, "V1" was assigned by default, changing to "TransID", the ID number of each transaction.
setkey(LaptopSales, TransID) # Ordering the data table by TransID. Note that this also orders the transactions by chronological order.
str(LaptopSales) # Checking the structure of the data. The Date column is not formatted correctly.
## Classes 'data.table' and 'data.frame':   148786 obs. of  17 variables:
##  $ TransID               : int  2 7 9 12 13 14 17 19 20 22 ...
##  $ Date                  : chr  "1/1/2008 0:02" "1/1/2008 0:14" "1/1/2008 0:24" "1/1/2008 1:03" ...
##  $ Configuration         : int  320 75 70 359 300 363 310 320 299 216 ...
##  $ Customer.Postcode     : chr  "SW4 0JL" "SE25 4LX" "W3 7JT" "W1T 1DG" ...
##  $ Store.Postcode        : chr  "SW12 9HD" "SE8 3JD" "SW18 1NN" "NW5 2QH" ...
##  $ Retail.Price          : int  545 465 455 665 440 565 505 545 490 535 ...
##  $ Screen.Size..Inches.  : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ Battery.Life..Hours.  : int  6 4 4 6 6 6 6 6 6 5 ...
##  $ RAM..GB.              : int  1 2 2 2 1 2 1 1 1 2 ...
##  $ Processor.Speeds..GHz.: num  2 2 2 2 1.5 2 2 2 1.5 2 ...
##  $ Integrated.Wireless.  : chr  "No" "No" "Yes" "Yes" ...
##  $ HD.Size..GB.          : int  300 80 120 300 80 80 120 300 80 300 ...
##  $ Bundled.Applications. : chr  "No" "Yes" "No" "Yes" ...
##  $ customer.X            : int  529240 534575 520898 529584 518891 530451 531065 530351 533256 541936 ...
##  $ customer.Y            : int  175537 168236 180071 181554 177662 183067 182190 180907 179711 183350 ...
##  $ store.X               : int  528739 537175 525155 529248 519585 528924 529902 534057 529902 535652 ...
##  $ store.Y               : int  173080 177885 175180 185213 177640 178440 179641 179682 179641 182961 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "TransID"
LaptopSales <- LaptopSales[, Date := as.POSIXct(Date, format ="%m/%d/%Y %H:%M", tz = "UTC")]
str(LaptopSales) # Changing the format of the dates and checking the structure again to make sure it is in order.
## Classes 'data.table' and 'data.frame':   148786 obs. of  17 variables:
##  $ TransID               : int  2 7 9 12 13 14 17 19 20 22 ...
##  $ Date                  : POSIXct, format: "2008-01-01 00:02:00" "2008-01-01 00:14:00" ...
##  $ Configuration         : int  320 75 70 359 300 363 310 320 299 216 ...
##  $ Customer.Postcode     : chr  "SW4 0JL" "SE25 4LX" "W3 7JT" "W1T 1DG" ...
##  $ Store.Postcode        : chr  "SW12 9HD" "SE8 3JD" "SW18 1NN" "NW5 2QH" ...
##  $ Retail.Price          : int  545 465 455 665 440 565 505 545 490 535 ...
##  $ Screen.Size..Inches.  : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ Battery.Life..Hours.  : int  6 4 4 6 6 6 6 6 6 5 ...
##  $ RAM..GB.              : int  1 2 2 2 1 2 1 1 1 2 ...
##  $ Processor.Speeds..GHz.: num  2 2 2 2 1.5 2 2 2 1.5 2 ...
##  $ Integrated.Wireless.  : chr  "No" "No" "Yes" "Yes" ...
##  $ HD.Size..GB.          : int  300 80 120 300 80 80 120 300 80 300 ...
##  $ Bundled.Applications. : chr  "No" "Yes" "No" "Yes" ...
##  $ customer.X            : int  529240 534575 520898 529584 518891 530451 531065 530351 533256 541936 ...
##  $ customer.Y            : int  175537 168236 180071 181554 177662 183067 182190 180907 179711 183350 ...
##  $ store.X               : int  528739 537175 525155 529248 519585 528924 529902 534057 529902 535652 ...
##  $ store.Y               : int  173080 177885 175180 185213 177640 178440 179641 179682 179641 182961 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "TransID"
# To check which columns contain NA values, I use is.na on the data, which returns booleans, and then sum each column. This way, if the sum of a column is 0, it contains no NA values and if the sum is > 0, there are NA values. W can see that the columns with missing values are Date, Retail.Price, store.X and store.Y.
colSums(is.na(LaptopSales))
##                TransID                   Date          Configuration 
##                      0                    110                      0 
##      Customer.Postcode         Store.Postcode           Retail.Price 
##                      0                      0                   6656 
##   Screen.Size..Inches.   Battery.Life..Hours.               RAM..GB. 
##                      0                      0                      0 
## Processor.Speeds..GHz.   Integrated.Wireless.           HD.Size..GB. 
##                      0                      0                      0 
##  Bundled.Applications.             customer.X             customer.Y 
##                      0                      0                      0 
##                store.X                store.Y 
##                     85                     85
# The data is ordered according to TransID and thus, this forces the data to be in chronological order. If a date is NA, the missing value must be after the date of the previous row and before the date of the following row. Let us compute the gap of time elapsed between each NA date transaction:
x = c()
for (i in 1:nrow(LaptopSales)){
  if (is.na(LaptopSales$Date[i])){ # Checking if date is NA
    a = LaptopSales$Date[i+1] - LaptopSales$Date[i-1] #computing the elapsed time
    x = append(x, a) # adding the elapsed time to the empty vector we created
  }
}
max(x) # finding the maximum time elapsed
## Time difference of 3060 secs
# Since the maximum time elapsed is of 3060 seconds, which is equal to 51 minutes, this allows to pinpoint the missing transactions in time quite precisely, considering this table spans over the whole year of 2008. We decided to attribute the mean between the dates of the previous and following transaction to the NA dates:

for (i in 1:nrow(LaptopSales)){
  if (is.na(LaptopSales$Date[i])){
    LaptopSales$Date[i] = mean(c(LaptopSales$Date[i-1],LaptopSales$Date[i+1]))
  }
}

# For the other three columns which contain missing values, we decided to take them out of the data table.
LaptopSales <- na.omit(LaptopSales)

1 Price questions

1.1 At what price are the laptops actually selling?

Let us first compute the mean and median values:

meanPrice = round(mean(LaptopSales$Retail.Price), digits = 2) # mean of laptop prices
medianPrice = median(LaptopSales$Retail.Price) # median of laptop prices

The mean value of laptop prices in 2008 is 508.12 £ and the median is 500 £. These values are quite close.

Let us explore in more detail the retail prices. A boxplot will give us an idea of laptop prices in 2008.

boxplot(LaptopSales$Retail.Price, xlab = "Boxplot of laptop retail prices in 2008", ylab = "Retail price")

We can see that most laptop prices range between 400 £ and 600 £ and without outliers, the broader range is between 200 £ and 800 £.

1.2 Does price change with time?

To get a better idea of how these prices evolve with respect to time, we can plot means and medians of retail prices monthly, weekly and daily.

First, we compute all the necessary means and medians.

# Aggregating dates by month, week and day
LaptopSales <- LaptopSales[, Month := floor_date(Date, "month")]
LaptopSales <- LaptopSales[, Week := floor_date(Date, "week")]
LaptopSales <- LaptopSales[, Day := floor_date(Date, "day")]

# Keeping only the relevant columns
TimeLaptopSales <- LaptopSales[, list(Date, Month, Week, Day, Retail.Price)]

# Computing monthly means and medians
TimeLaptopSales <- TimeLaptopSales[, MonthMean := mean(Retail.Price), by = Month]
TimeLaptopSales <- TimeLaptopSales[, MonthMedian := median(Retail.Price), by = Month]

# Computing weekly means and medians
TimeLaptopSales <- TimeLaptopSales[, WeekMean := mean(Retail.Price), by = Week]
TimeLaptopSales <- TimeLaptopSales[, WeekMedian := median(Retail.Price), by = Week]

# Computing daily means and medians
TimeLaptopSales <- TimeLaptopSales[, DayMean := mean(Retail.Price), by = Day]
TimeLaptopSales <- TimeLaptopSales[, DayMedian := median(Retail.Price), by = Day]

Then, we plot the six respective graphs.

# Plotting the two graphs with options to draw a line, scale the variables appropriately, label the axis and adjust background colors.
ggplot(TimeLaptopSales, aes(x = Month, y = MonthMean)) + geom_line() +
  scale_x_datetime(date_labels = "%b", date_breaks  ="1 month") +
  scale_y_continuous(limits = c(440, 550), breaks = seq(440, 550, 10)) +
  labs(x = "Year 2008", y = "Monthly mean of retail prices") +
  theme_light()

ggplot(TimeLaptopSales, aes(x = Month, y = MonthMedian)) + geom_line() +
  scale_x_datetime(date_labels = "%b", date_breaks  ="1 month") +
  scale_y_continuous(limits = c(440, 550), breaks = seq(440, 550, 10)) +
  labs(x = "Year 2008", y = "Monthly median of retail prices") +
  theme_light()

# Plotting the two graphs with options to draw a line, scale the variables appropriately, label the axis and adjust background colors.
ggplot(TimeLaptopSales, aes(x = Week, y = WeekMean)) + geom_line() +
  scale_x_datetime(date_labels = "%b", date_breaks  ="1 month") +
  scale_y_continuous(limits = c(440, 550), breaks = seq(440, 550, 10)) +
  labs(x = "Year 2008", y = "Weekly mean of retail prices") +
  theme_light()

ggplot(TimeLaptopSales, aes(x = Week, y = WeekMedian)) + geom_line() +
  scale_x_datetime(date_labels = "%b", date_breaks  ="1 month") +
  scale_y_continuous(limits = c(440, 550), breaks = seq(440, 550, 10)) +
  labs(x = "Year 2008", y = "Weekly median of retail prices") +
  theme_light()

# Plotting the two graphs with options to draw a line, scale the variables appropriately, label the axis and adjust background colors.
ggplot(TimeLaptopSales, aes(x = Day, y = DayMean)) + geom_line() +
  scale_x_datetime(date_labels = "%b", date_breaks  ="1 month") +
  scale_y_continuous(limits = c(430, 550), breaks = seq(430, 550, 10)) +
  labs(x = "Year 2008", y = "Weekly mean of retail prices") +
  theme_light()

ggplot(TimeLaptopSales, aes(x = Day, y = DayMedian)) + geom_line() +
  scale_x_datetime(date_labels = "%b", date_breaks  ="1 month") +
  scale_y_continuous(limits = c(430, 550), breaks = seq(430, 550, 10)) +
  labs(x = "Year 2008", y = "Weekly median of retail prices") +
  theme_light()

These plots allow us to get a more precise understanding of laptop prices. We can observe that as of December 2008, prices are around the mean and median values, as they were in January 2008. If we had data for other years, we could infer if this pattern is cyclical or not.

Moreover, we can see that prices tend to be lower in the first five months of 2008 and higher in the seven remaining months, with a drastic jump in prices from May to June.

Looking at the daily data in particular, we can see that for the first, third and fourth quarter of 2008, prices are stable during the first two months and then drop by approximately 30 £ for the third month. During the second quarter of 2008, prices are also stable for the first two months, but are followed by and increase of approximately 60£. Between each quarter, prices rise by 20 £ approximately.

Comparing mean and median plots, we can see that they are globally similar, with small differences. The mean gives more variation in each plot than the median, which tends to smooth out the curves more. For the following steps, we will continue working with the mean.

1.3 Are prices consistent across retail outlets?

As we will see below, prices are not consistent across retail outlets, with two different patterns emerging.

First, we identify how many stores there are by creating StoreLocations, a data table which contains all stores coordinates:

StoreLocations <- LaptopSales[, list(Store.Postcode, store.X, store.Y)] # Creating table of store coordinates
StoreLocations <- StoreLocations[!duplicated(StoreLocations),] # removing all duplicate rows
                                        
kable_styling(kable(StoreLocations, col.names = c("Store Postcode", "X Coordinate", "Y Coordinate"), align = "lcc"), full_width = FALSE) # displaying store coordinates table
Store Postcode X Coordinate Y Coordinate
SW12 9HD 528739 173080
SE8 3JD 537175 177885
SW18 1NN 525155 175180
NW5 2QH 529248 185213
W4 3PH 519585 177640
SW1V 4QQ 528924 178440
SW1P 3AU 529902 179641
SE1 2BN 534057 179682
E2 0RY 535652 182961
CR7 8LE 532714 168302
N3 1DH 525109 190628
W10 6HQ 524190 181567
KT2 5AU 517917 170243
E7 8NW 541428 184515
N17 6QA 533788 189994

Since no postcode is repeated twice in our table, we can aggregate with respect to Store.Postcode.

Let us plot a barchart of mean sale prices in 2008, by store.

StoreLaptopSales <- LaptopSales[, list(Date, Month, Week, Day, Retail.Price, Store.Postcode)] # keeping only relevant columns
StoreLaptopSales <- StoreLaptopSales[, StoreMean := mean(Retail.Price), by = Store.Postcode] # computing mean price by store

StoreBarplotData <- StoreLaptopSales[, list(Store.Postcode, StoreMean)] # creating a data table for the barchart, with the only two relevant columns
StoreBarplotData <- StoreBarplotData[!duplicated(StoreBarplotData),] # getting rid of duplicate rows
StoreBarplotData <- StoreBarplotData[, CharStoreMean := as.character(round(StoreMean,2))] # rounding the store mean and passing it as a character for ggplot to interpret it as a series of discrete values, giving one bar per store

# plotting the barchart with aesthetic tweaking
ggplot(StoreBarplotData, aes(x =CharStoreMean, y = StoreMean, fill = CharStoreMean)) +
  geom_bar(stat = "identity") + theme_light() +
  theme(axis.text.x = element_blank()) +
  labs(x = "Stores", y = "Annual mean retail price for 2008", fill = "Annual mean")

# We want to reorder the StoreLocations table by ascending mean value for convenience. We do so by merging StoreBarplotData and StoreLocations

StoreLocations <- merge(StoreLocations, StoreBarplotData, by = "Store.Postcode") # merging
setkey(StoreLocations, StoreMean) # ordering by ascending mean value
StoreLocations <- StoreLocations[, list(Store.Postcode, store.X, store.Y)] # keeping only the 3 relevant columns

We can clearly see a difference between the first five stores, with means around 470 £ and the other stores, with means around 520 £.

Now, let us plot monthly, weekly and daily means like before, but by store. This will allow us to see if there are differences in behavior between stores with lower and higher means.

StoreLaptopSales <- StoreLaptopSales[, StoreMean := as.character(round(StoreMean, 2))] # like before, we round and pass the mean as a string so ggplot recognizes it as a set of discrete values. We will thus get 15 different lines plotted.

StoreLaptopSales <- StoreLaptopSales[, StoreMonthMean := mean(Retail.Price), by = list(Store.Postcode, Month)] # computing monthly means by store
StoreLaptopSales <- StoreLaptopSales[, StoreWeekMean := mean(Retail.Price), by = list(Store.Postcode, Week)] # computing weekly means by store
StoreLaptopSales <- StoreLaptopSales[, StoreDayMean := mean(Retail.Price), by = list(Store.Postcode, Day)] # computing daily means by store

# Plotting monthly, weekly and daily means by store with customization options similar to the ones used before.
ggplot(StoreLaptopSales, aes(x = Month, y = StoreMonthMean, color = StoreMean)) + geom_line() +
  labs(x = "Year 2008", y = "Monthly mean of retail prices", color = "Annual mean of store") +
  theme_light() + scale_x_datetime(date_labels = "%b", date_breaks  ="1 month")

ggplot(StoreLaptopSales, aes(x = Week, y = StoreWeekMean, color = StoreMean)) + geom_line() +
  labs(x = "Year 2008", y = "Weekly mean of retail prices", color = "Annual mean of store") +
  theme_light() + scale_x_datetime(date_labels = "%b", date_breaks  ="1 month")

ggplot(StoreLaptopSales, aes(x = Day, y = StoreDayMean, color = StoreMean)) + geom_line() +
  labs(x = "Year 2008", y = "Daily mean of retail prices", color = "Annual mean of store") +
  theme_light() + scale_x_datetime(date_labels = "%b", date_breaks  ="1 month")

We can clearly see that stores with low annual means have a very different behavior from other stores. Their prices variate strongly, with huge drops at the third month of each quarter, followed by steep increases at the beginning of the following quarter.

Instead, stores with high annual means show a slow and steady decrease of prices, except from May to June with a sudden high jump.

Moreover, the local maximum retail prices of low annual mean stores always coincide with the retail prices of high annual mean stores.

1.4 How does price change with configuration?

We will see that configurations are split into six categories with similar mean price patterns. Let us plot the mean retail price with respect to the configuration number.

ConfigLaptopSales <- LaptopSales[, list(Date, Month, Week, Day, Store.Postcode, Configuration, Retail.Price)] # keeping only the relevant columns
ConfigLaptopSales <- ConfigLaptopSales[, ConfigMean := mean(Retail.Price), by = Configuration] # computing the mean price by configuration number

# plotting with colors to underline the six similar patterns (using cut)
ggplot(ConfigLaptopSales, aes(x = Configuration, y = ConfigMean, color = cut(Configuration, breaks = c(0, 144, 288, 432, 576, 720, 864)))) + geom_line() + theme_light() + 
  scale_x_continuous(limits = c(0, 864), breaks = seq(0, 864, 100)) +
  labs(x = "Configuration Number", y = "Mean retail price", color = "Configuration levels")

We can clearly see that configurations are split into six even categories, containing 144 configuration numbers each. Each category has the same pattern of retail price, and each time the pattern is repeated, it is slightly translated vertically, thus increasing the general price range of the category.

2 Location questions

2.1 Where are the stores and customers located?

Stores and customers are located in the United Kingdom. We will create a map to display their exact locations.

We begin by using the already provided code to create a map of the London area:

# Get a map with open street map
map_os <- get_map(getbb("London"), source = "osm")
ggmap(map_os)

The provided code also allows us to convert store coordinates from EPSG:27700 (United Kingdom coordinate system) to EPSG:4326 (commonly used standard for GIS data).

# We already have the StoreLocations data created previously

# Create SpatialPointsDataframe
StoreLocations_SP <- SpatialPointsDataFrame( 
  data = data.frame(StoreLocations$Store.Postcode), # data, stores names (postcodes)
  coords = StoreLocations[, c("store.X", "store.Y")], # coordinates, x for "Easting", y for "Northing"
  proj4string = CRS( "+init=epsg:27700" ) ) # proj4string of the coordinates, assign CRS to data

# Transform coordinates
StoreLocations_SP_LL <- spTransform(StoreLocations_SP, CRS("+init=epsg:4326"))

# Transform to datatable
StoreLocations_LL <- data.frame(StoreLocations_SP_LL)[,c(1:3)]      # keep first 3 columns
colnames(StoreLocations_LL) <- c( "Store", "S.Longitude","S.Latitude")    # rename stores variables
StoreLocations_LL <- data.table(StoreLocations_LL) # from dataframe to datatable

We use the same technique to convert customer location coordinates.

# Create a CustomerLocations data table like we did for StoreLocations
CustomerLocations <- LaptopSales[, list(Customer.Postcode, customer.X, customer.Y)]
CustomerLocations <- CustomerLocations[!duplicated(CustomerLocations),]

# Create SpatialPointsDataframe
CustomerLocations_SP <- SpatialPointsDataFrame( 
  data = data.frame(CustomerLocations$Customer.Postcode), # data, stores names (postcodes)
  coords = CustomerLocations[, c("customer.X", "customer.Y")], # coordinates, x for "Easting", y for "Northing"
  proj4string = CRS( "+init=epsg:27700" ) ) # proj4string of the coordinates, assign CRS to data

# Transform coordinates
CustomerLocations_SP_LL <- spTransform(CustomerLocations_SP, CRS("+init=epsg:4326"))

# Transform to datatable
CustomerLocations_LL <- data.frame(CustomerLocations_SP_LL)[,c(1:3)]      # keep first 3 columns
colnames(CustomerLocations_LL) <- c( "Customer", "C.Longitude","C.Latitude")    # rename stores variables
CustomerLocations_LL <- data.table(CustomerLocations_LL) # from dataframe to datatable

Now, we can plot the store and customer coordinates on the map we created.

# Plotting the map we created, adding customer points, then store points and adjusting labels of the plot.
ggmap(map_os) +
  geom_point(aes(x = C.Longitude, y = C.Latitude, stroke = 0.2, color = "Customer"), data = CustomerLocations_LL) +
  geom_point(aes(x = S.Longitude, y = S.Latitude, stroke = 1, color = "Store"), data = StoreLocations_LL) +
  labs( x = "Longitude", y = "Latitude", color = "")

We see that most customers live around London and stores are placed accordingly: most stores are near London’s center, with a few stores evenly placed in the outskirts.

2.2 Which stores are selling the most?

We compute the total sales of each store and plot them with respect to the mean retail price of each store. This will allow us to have an overview of the total sales and see if there is a relationship with the mean retail prices of laptops.

StoreLaptopSales <- StoreLaptopSales[, StoreSales := sum(Retail.Price), by = Store.Postcode] # computing the sum of all sales by store
StoreBarplotData <- merge(StoreBarplotData, unique(StoreLaptopSales[, list(Store.Postcode, StoreSales)]), by = "Store.Postcode") # merging with the StoreBarplotData table to get the retail price means by store
setkey(StoreBarplotData, StoreMean) # ordering StoreBarplotData by mean retail price

# Plotting a bar chart with similar aesthetic options as before. 
ggplot(StoreBarplotData, aes(x = CharStoreMean, y = StoreSales, fill = CharStoreMean)) +
  geom_bar(stat = "identity") + theme_light() +
  theme(axis.text.x = element_blank()) +
  labs(x = "Stores", y = "Total store sales in 2008", fill = "Annual mean") +
  scale_y_continuous(limits = c(0, 12500000), breaks = seq(0, 12500000, 1000000), labels = ~ format(.x, scientific = FALSE))

There seems to be no relationship between the mean retail prices and the total sales of stores. Four of the five stores with low means have low levels of sales, but the fifth store is the highest grossing store of the year.

We plot the bar chart by increasing order of total sales to get a better idea of the distribution:

# Plotting a bar chart with similar aesthetic options as before. We use reorder() to arrange the bars by size. 
ggplot(StoreBarplotData, aes(x =reorder(CharStoreMean, StoreSales), y = StoreSales, fill = CharStoreMean)) +
  geom_bar(stat = "identity") + theme_light() +
  theme(axis.text.x = element_blank()) +
  labs(x = "Stores", y = "Total store sales in 2008", fill = "Annual mean") +
  scale_y_continuous(limits = c(0, 12500000), breaks = seq(0, 12500000, 1000000), labels = ~ format(.x, scientific = FALSE))

We can see that stores are split into four discernible total sales categories:

  • First six stores range between 500’000 £ and 1’500’000 £
  • Following four stores range between 3’000’000 £ and 4’500’000 £
  • Following two stores range around 8’000’000 £
  • Last three stores range between 11’000’000 £ and 12’500’000 £

There is a considerable gap between each category.

We plot each store on the map, using color to distinguish between categories and see how they are laid out.

# We plot stores on the map like before, using breaks to determine the colors of each category.
ggmap(map_os) +
  geom_point(aes(x = S.Longitude, y = S.Latitude, stroke = 1, color = cut(StoreBarplotData$StoreSales, breaks = c(0, 2000000, 6000000, 9000000, 13000000))), data = StoreLocations_LL) +
  labs( x = "Longitude", y = "Latitude", color = "Total sales categories") + scale_color_discrete(labels = c("500'000 - 1'500'000", "3'000'000 - 4'500'000", "8'000'000", "11'000'000 - 12'500'000"))

The result is very clear. Without surprise, the highest total sales are made by stores in central London (purple). On the contrary, the six stores placed on the outskirts of London have the lowest total sales (salmon). In between the two extremes, there is a circle formed by the second and third categories (green and blue). It would be interesting to further investigate to see how wealth is distributed between northeastern suburbs and southwestern ones or if there are other factors influencing this total sales difference.

2.3 How far would customers travel to buy a laptop?

Let us create a table containing postcodes and coordinates of stores and customers for each transaction. We eliminate duplicate rows from this table (if the same customer goes multiple times to the same store, the distance will always be the same, thus this information is not relevant when we want to compute or plot distances).

sum(duplicated(CustomerLocations$Customer.Postcode)) # this sum is == 0, which means there are no two customers with the same postcodes. In other words, the postcode can act as a unique identifier for any customer.
## [1] 0
Distances <- LaptopSales[, list(Customer.Postcode, Store.Postcode)] # Keeping only postcodes from the LaptopSales data
Distances <- Distances[!duplicated(Distances),] # getting rid of duplicate rows
Distances <- merge(Distances, StoreLocations_LL, by.x = "Store.Postcode", by.y = "Store", all.x = TRUE) # merging with StoreLocations_LL to get longitude and latitude of stores
Distances <- merge(Distances, CustomerLocations_LL, by.x = "Customer.Postcode", by.y = "Customer", all.x = TRUE) # merging with CustomerLocations_LL to get longitude and latitude of customers
Distances = Distances[, list(Store.Postcode, S.Longitude, S.Latitude, Customer.Postcode, C.Longitude, C.Latitude)] # ordering columns

Let us plot a store and all of its customers on the map. We have defined a global parameter of this R Markdown file at the beginning of the document, called store_post_code. By default, the selected post code is SW1P 3AU. By changing the value of store_post_code at the beginning of this document and knitting the file, we can plot any of the 15 stores and their respective customers.

DistancesChosenStore <- Distances[Store.Postcode %in% params$store_post_code,]
ChosenStoreLocation <- StoreLocations_LL[Store %in% params$store_post_code,]

ggmap(map_os) +
  geom_point(aes(x = C.Longitude, y = C.Latitude, color = "Customers"), data = DistancesChosenStore) +
  geom_point(aes(x = S.Longitude, y = S.Latitude, color = "Store"), data = ChosenStoreLocation) +
  labs( x = "Longitude", y = "Latitude", color = "")

We can also compute the means and medians of coordinates by store, which would give us an idea of where the prototypical customer of a store lives, once plotted.

# Computing means and medians of longitudes and latitudes, by store:
Distances <- Distances[, MeanLong := mean(C.Longitude), by = Store.Postcode]
Distances <- Distances[, MeanLat := mean(C.Latitude), by = Store.Postcode]
Distances <- Distances[, MedianLong := median(C.Longitude), by = Store.Postcode]
Distances <- Distances[, MedianLat := median(C.Latitude), by = Store.Postcode]

# creating a data table which contains the relevant columns for plotting. Using unique() to get rid of suplicate rows (we thus have 15 rows, for the 15 stores).
ProtoCustLoc <- unique(Distances[, list(Store.Postcode, S.Longitude, S.Latitude, MeanLong, MeanLat, MedianLong, MedianLat)])

# Plotting on the map: store coordinates, mean coordinates, median coordinates and two segments to like mean and median to their store.
ggmap(map_os) +
  geom_point(aes(x = MeanLong, y = MeanLat, color = "Coordinates Mean"), data = ProtoCustLoc) +
  geom_point(aes(x = MedianLong, y = MedianLat, color = "Coordinates Median"), data = ProtoCustLoc) +
  geom_point(aes(x = S.Longitude, y = S.Latitude, color = "Store"), data = ProtoCustLoc) +
  geom_segment(data = ProtoCustLoc, aes(x = S.Longitude, xend = MeanLong, y = S.Latitude, yend = MeanLat)) +
   geom_segment(data = ProtoCustLoc, aes(x = S.Longitude, xend = MedianLong, y = S.Latitude, yend = MedianLat)) +
  labs( x = "Longitude", y = "Latitude", color = "")

We can see that in general, the median distance of customers is further away from the store than the mean, but there are a few exceptions. Some stores have very close mean and median coordinates, while others, especially far away from central London, have more distance between store and customer.

2.4 Try an alternative way of looking at how far customers traveled. Do this by creating a new data column that computes the distance between customer and store.

We compute the distance between each customer and the stores they visited using distGeo from the geosphere package. distGeo computes the shortest distance between two points on an ellipsoid, using as default the WGS84 ellipsoid, a standard geodesic model used to represent planet earth.

Distances <- Distances[, list(Store.Postcode, S.Longitude, S.Latitude, Customer.Postcode, C.Longitude, C.Latitude)] # keeping only relevant columns
Distances <- Distances[, Dist := distGeo(matrix(c(S.Longitude, S.Latitude), ncol = 2), matrix(c(C.Longitude, C.Latitude), ncol = 2))] # using distGeo to compute the distance (result is in meters).
Distances <- merge(Distances, StoreBarplotData, by = "Store.Postcode") # merging with StoreBarplotData to get the StoreSales column
Distances <- Distances[, list(Store.Postcode, S.Longitude, S.Latitude, Customer.Postcode, C.Longitude, C.Latitude, Dist, StoreSales)] # taking out irrelevant columns after merging

First, let us look at the overall range of Dist, as well as the mean and median values. We also compute a boxplot of Dist.

# computing min, median, mean, max
minDist = min(Distances$Dist)
medianDist = median(Distances$Dist)
meanDist = mean(Distances$Dist)
maxDist = max(Distances$Dist)

print(minDist)
## [1] 0
print(medianDist)
## [1] 3367.087
print(meanDist)
## [1] 3675.49
print(maxDist)
## [1] 19895.99
#plotting the boxplot for Dist
boxplot(Distances$Dist, xlab = "Distance from Customer to Store", ylab = "Distance in meters")

However, it is important to see how the distances are distributed by store. We will now plot individual boxplots in total sales ascending order.

# plotting boxplots for distances by store. The stores are ordered by ascending total sales.
ggplot(Distances, aes(x = reorder(Store.Postcode, StoreSales), y = Dist)) + geom_boxplot() +
  theme_light() + labs(x = "Stores", y = "Distance in meters") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

We can clearly see that the five stores with highest total sales, which also were the most centered ones (purple and blue categories) all have boxplots with low median distance values and small variances (the boxes are smaller).

3 Revenue questions

3.1 How do the sales volume in each store relate to Acell’s revenues?

TotalSales = sum(StoreBarplotData$StoreSales)

The total sales of all stores amount to 72177609 £. This should represent the biggest part of Acell’s revenues.

We now plot a Pareto chart of the sales by store:

setkey(StoreBarplotData, StoreSales) # ordering the data wrt StoreSales
options(scipen=999) # disabling scientific notation for clarity

# plotting the pareto chart:
pareto.chart(StoreBarplotData$StoreSales, main = "Pareto Chart for Store Sales", las = 0) 

##    
## Pareto chart analysis for StoreBarplotData$StoreSales
##            Frequency        Cum.Freq.       Percentage     Cum.Percent.
##   O 12241005.0000000 12241005.0000000       16.9595601       16.9595601
##   N 11589855.0000000 23830860.0000000       16.0574105       33.0169707
##   M 11300615.0000000 35131475.0000000       15.6566769       48.6736475
##   L  8182115.0000000 43313590.0000000       11.3360849       60.0097324
##   K  8010470.0000000 51324060.0000000       11.0982756       71.1080080
##   J  4285700.0000000 55609760.0000000        5.9377140       77.0457220
##   I  3651725.0000000 59261485.0000000        5.0593599       82.1050819
##   H  3299155.0000000 62560640.0000000        4.5708843       86.6759662
##   G  3091830.0000000 65652470.0000000        4.2836415       90.9596077
##   F  1544783.0000000 67197253.0000000        2.1402524       93.0998601
##   E  1286101.0000000 68483354.0000000        1.7818559       94.8817160
##   D  1239960.0000000 69723314.0000000        1.7179289       96.5996449
##   C  1124345.0000000 70847659.0000000        1.5577476       98.1573926
##   B   741811.0000000 71589470.0000000        1.0277578       99.1851503
##   A   588139.0000000 72177609.0000000        0.8148497      100.0000000

We can see from the chart and data given by pareto.chart() that the five stores with highest sales account each for between 11% and 17% of Acell’s sales revenues, amounting together to almost 75%.

In other words, one third of stores brings 75% of sales revenues!

3.2 How does this relationship depend on the configuration?

We had previously identified six configuration levels. We will now compute the total sales by configuration and plot them, while still outlining our six groups by color.

ConfigLaptopSales <- ConfigLaptopSales[, ConfigSales := sum(Retail.Price), by = Configuration] # computing the sum of sales by configuration
ConfigRevenues <- unique(ConfigLaptopSales[, list(Configuration, ConfigSales)]) # creating ConfigRevenues with only two columns, getting rid of duplicate rows
setkey(ConfigRevenues, Configuration) # ordering ConfigRevenues by configuration number

# plotting total sales with respect to configuration number
ggplot(ConfigRevenues, aes(x = Configuration, y = ConfigSales, color = cut(Configuration, breaks = c(0, 144, 288, 432, 576, 720, 864)))) + geom_point() +
  theme_light() + 
  scale_x_continuous(limits = c(0, 864), breaks = seq(0, 864, 100)) +
  labs(x = "Configuration Number", y = "Total sales", color ="Configuration levels")

We see that each group of configurations is made of different parts.

All groups have a bottom portion with configurations which have low total sales (roughly under 60’000 £), then a blank gap followed by a second portion, with higher total sales (roughly between 60’000 and 125’000 £).

Only the first three configuration groups (with numbers smaller than or equal to 432) have a third top portion, with the highest grossing configurations (roughly between 125’000 and 200’000 £).

Let us split the configurations according to their total sales, using the 60’000 £ and 125’000 £ boundaries we estimated from the graph.

We compute the number of configurations in each tier and their respective sum of sales. We then compute the ratios between the sales of each tier and the overall total sales, and compare them to the ratio between the number of configurations in the tier and the overall total configurations.

# Computing the number of configurations in each tier
ConfigNumberTop <- nrow(ConfigRevenues[ConfigSales >= 125000,])
ConfigNumberMid <- nrow(ConfigRevenues[ConfigSales < 125000 & ConfigSales >= 60000,])
ConfigNumberBot <- nrow(ConfigRevenues[ConfigSales < 60000,])

# Computing the sum of sales in each tier
ConfigRevenuesTop <- sum(ConfigRevenues[ConfigSales >= 125000,ConfigSales])
ConfigRevenuesMid <- sum(ConfigRevenues[ConfigSales < 125000 & ConfigSales >= 60000,ConfigSales])
ConfigRevenuesBot <- sum(ConfigRevenues[ConfigSales < 60000,ConfigSales])

# creating a data table with the ratios for sales and configurations
ConfigPercents <- as.data.table(list(c(ConfigRevenuesTop, ConfigRevenuesMid, ConfigRevenuesBot)/TotalSales, c(ConfigNumberTop, ConfigNumberMid, ConfigNumberBot)/nrow(ConfigRevenues)))

colnames(ConfigPercents) <- c("SalesPercentages", "ConfigPercentages") # adding column names
ConfigPercents <- ConfigPercents[, Tier:= c("Top tier", "Mid tier", "Bottom tier")] # adding row with tier names
ConfigPercents <- ConfigPercents[, list(Tier, SalesPercentages, ConfigPercentages)] # reordering columns

kable_styling(kable(ConfigPercents, col.names = c("Tier", "Ratio of Total Sales", "Ratio of Configurations"), align = "lcc"), full_width = FALSE) # displaying the result as a table
Tier Ratio of Total Sales Ratio of Configurations
Top tier 0.2278859 0.1215278
Mid tier 0.5886451 0.5381944
Bottom tier 0.1834690 0.3402778

We can see that we indeed have a high performing Top Tier, as 12% of all configurations bring 22% of sales revenues.

We also have a low performing Bottom Tier: 34% of configurations bring only 18% of sales revenues.

The percentages of the Mid Tier, 53% and 58% are quite close, there is no unbalance in this group.

This might be because it is quite aggregated. It would maybe be possible to find hidden patterns by splitting this tier in different parts. However, these three tiers were the ones forming naturally when plotting the data.

4 Configuration questions

4.1 What are the details of each configuration? How does this relate to price?

We have several configuration parameters:

  • Battery life in hours
  • Bundled applications (Yes/No)
  • HD size in GB
  • Integrated Wireless (Yes/No)
  • Processor speed in GHz
  • RAM in GB
  • Screen size in inches

We begin by plotting these parameters with respect to the configuration number. For the Yes/No variables, we attribute 1/0 values.

ConfigDetail <- unique(LaptopSales[, list(Retail.Price, Configuration, ScreenSizeIn = Screen.Size..Inches., BatteryLifeH = Battery.Life..Hours., RAM.GB = RAM..GB., ProcessorSpeedsGHz = Processor.Speeds..GHz., IntegratedWireless = Integrated.Wireless., HDSizeGB = HD.Size..GB., BundledApps = Bundled.Applications.)]) # selecting relevant columns and renaming them
setkey(ConfigDetail, Configuration) # ordering by Configuration

# transforming Yes/No in 1/0 values
ConfigDetail$IntegratedWireless<-ifelse(ConfigDetail$IntegratedWireless == "Yes",1,0)
ConfigDetail$BundledApps<-ifelse(ConfigDetail$BundledApps == "Yes",1,0)

# creating a data table with fewer columns using pivot_longer() to allow facet plotting
ConfigPlot <- pivot_longer(data = ConfigDetail[, list(Configuration, ScreenSizeIn, BatteryLifeH, RAM.GB, ProcessorSpeedsGHz, IntegratedWireless, HDSizeGB, BundledApps)], cols = !Configuration, names_to = "Parameters")
# plotting with facet to allow multiple plots with the same x axis
ggplot(ConfigPlot, aes(x = Configuration, y = value)) +
  geom_point(size = 0.5) + theme_light() + facet_grid(Parameters ~ ., scales = "free_y") + labs(x = "Configuration number", y = "")
Plots of Battery life in hours, Bundled applications (Yes/No), HD size in GB, Integrated Wireless (Yes/No), Processor speed in GHz, RAM in GB and Screen size in inches, all with respect to Configuration number

Plots of Battery life in hours, Bundled applications (Yes/No), HD size in GB, Integrated Wireless (Yes/No), Processor speed in GHz, RAM in GB and Screen size in inches, all with respect to Configuration number

We can see that the different customization parameters follow clear patterns with respect to configurations numbers. Indeed each configuration number (there are 864 in total) represents a unique combination of all these parameters.

We can display the following table of parameters and their respective values:

# we create the table
ConfigParams <- data.table(Parameters = c("ScreenSizeIn", "BatteryLifeH", "RAM.GB", "ProcessorSpeedsGHz", "IntegratedWireless", "HDSizeGB", "BundledApps"), values = c("15 - 17", "4 - 5 - 6", "1 - 2 - 4", "1.5 - 2.0 - 2.5", "Yes/No", "40 - 80 - 120 - 300", "Yes/No"), valuesNumber = c(2, 3, 3, 3, 2, 4, 2))

#printing the table on R markdown
kable_styling(kable(ConfigParams, col.names = c("Parameter", "Values of parameter", "Number of values"), align = "llc"), full_width = FALSE)
Parameter Values of parameter Number of values
ScreenSizeIn 15 - 17 2
BatteryLifeH 4 - 5 - 6 3
RAM.GB 1 - 2 - 4 3
ProcessorSpeedsGHz 1.5 - 2.0 - 2.5 3
IntegratedWireless Yes/No 2
HDSizeGB 40 - 80 - 120 - 300 4
BundledApps Yes/No 2

Then, all possible combinations of these parameters are computed by multiplying the third column together. This result indeed gives us 864, our total number of configurations.

Let us now plot the sale prices of laptops with respect to each configuration parameter. We get the seven following plots.

# seven similar plots, with sames options as before
ggplot(ConfigDetail, aes(x = ScreenSizeIn, y = Retail.Price)) +
  geom_point() + theme_light() + labs(x = "Screen size in inches", y = "Retail price of laptop")

ggplot(ConfigDetail, aes(x = BatteryLifeH, y = Retail.Price)) +
  geom_point() + theme_light() + labs(x = "Battery life in hours", y = "Retail price of laptop")

ggplot(ConfigDetail, aes(x = RAM.GB, y = Retail.Price)) +
  geom_point() + theme_light() + labs(x = "RAM in GB", y = "Retail price of laptop")

ggplot(ConfigDetail, aes(x = ProcessorSpeedsGHz, y = Retail.Price)) +
  geom_point() + theme_light() + labs(x = "Processor Speed in GHz", y = "Retail price of laptop")

ggplot(ConfigDetail, aes(x = IntegratedWireless, y = Retail.Price)) +
  geom_point() + theme_light() + labs(x = "Integrated Wireless (Yes = 1, No = 0)", y = "Retail price of laptop")

ggplot(ConfigDetail, aes(x = HDSizeGB, y = Retail.Price)) +
  geom_point() + theme_light() + labs(x = "HD size in GB", y = "Retail price of laptop")

ggplot(ConfigDetail, aes(x = BundledApps, y = Retail.Price)) +
  geom_point() + theme_light() + labs(x = "Bundled applications (Yes = 1, No = 0)", y = "Retail price of laptop")

We can see that the increase in parameters also increases the sale prices.

4.2 Do all stores sell all configurations?

We use the store_post_code parameter introduced before. This allows us to compute the number of missing configurations for any store.

ConfigsChosenStore <- ConfigLaptopSales[Store.Postcode %in% params$store_post_code,] # selecting the chosen store only
ConfigsChosenStore <- unique(ConfigsChosenStore[, list(Store.Postcode, Configuration)]) # listing all configurations in the chose store, getting rid of duplicates

# checking if the number of rows is equal to the configurations total number (864), computing the difference if the number isn't equal and printing appropriate messages
if (nrow(ConfigsChosenStore) == 864) {
  print("All configurations are available at this store")
} else {
  print("Number of missing configurations:")
  print(864-nrow(ConfigsChosenStore))
}
## [1] "All configurations are available at this store"

By testing all different stores, we can deduce that four stores with higher total sales, SW1P 3AU, SE1 2BN, SW1V 4QQ and NW5 2QH have all configurations available.

On the contrary, stores with low total sales also have a reduced selection of configurations, ranging from 76 (store CR7 8LE) to 270 (store N3 1DH) configurations unavailable.